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VECTORIZAT1QN ON THE STAR COMPUTER 
OF SEVERAL NUMERICAL METHODS 
FOR A FLUID FLOW PROBLEM 

By Jules J. Lambiotte, Jr. , and Lona M. Howser 
Langley Research Center 

SUMMARY 

A reexamination of some numerical methods is considered in light of the new class 
of computers which use vector streaming to achieve high computation rates. A study has 
been made of the effect on the relative efficiency of several numerical methods applied to 
a particular fluid flow problem when they are implemented on a vector computer. The 
method of Brailovskaya, the alternating direction implicit method, a fully implicit method, 
and a new method called partial implicitization have been applied to the problem of deter- 
mining the steady-state solution of the two-dimensional flow of a viscous incompressible 
fluid in a square cavity driven by a sliding wall. 

The characteristics of the Control Data STAR computer have been used in this study. 
The timing of vector operations has been considered to develop order of computation con- 
cepts for the STAR computer. 

Results were obtained on the Control Data 6600 computer system for three mesh 
sizes and a comparison was made of the methods for serial computation. The methods 
were vectorized for the STAR computer and expected timings were used to compare one 
iteration of each vectorized version as a function of grid size. The methods which are 
explicit in form are shown to vectorize better than the implicit methods in the sense that 
they allow the use of large vectors in the computations. This advantage becomes less 
important as the number of grid points increases. Two implementations of che alternating 
direction implicit method are presented, one of which uses a proposed parallel algorithm 
for solving a tridiagonal system of equations. This algorithm is shown to possess unde- 
sirable characteristics with respect to the STAR computer. Another disadvantage cf the 
alternating direction implicit method, poor program locality in a paging environment, is 
pointed out and a possible solution is proposed. 



The introduction of toe Control Data STAR vector streai ing computer necessitates 
the reevaluation of many numerical methods presently being used on a serial computer. 

The relative efficiency of known methods may change when they are used on a vector com- 
puter. Also, nev methods will be, and have been, formulated for use on the advanced 
computers. The process of organizing the data and calculations within a numerical method 
so that the operations performed take advantage of the STAR vector instructions available 
is referred to as the vectorization of the method. This report presents the results of a £ 

study into the effect of vectorization on several numerical methods currently being used 
for fluid flow problems. Also included in the study is a new method proposed for the STAR 
computer. > 

A natural classification of finite-difference methods for a time -dependent solution to 
a fluid flow problem is either as an explicit or implicit method. An explicit method 
expresses the updated solution variable at each grid point at time t + At as a function of 
previously computed information. These methods are relatively easy to formulate but 
have the disadvantage of requiring a small time step to maintain numerical stability. An 
implicit method expresses a relationship between all or some of the solution variables at 
the updated time simultaneously; this gives rise to the necessity of solving a set of simul- 
taneous equations. The implicit algorithms normally have no stability restrictions in 
theory but are more difficult to use in an efficient manner. 

The two-dimensional flow of a viscous incompressible fluid in a square cavity driven 
by a sliding wall was chosen as a model problem. Both explicit and implicit methods were 
used on a serial computer (Control Data 6600 computer system) to obtain results for three 
grid sizes. The methods chosen for this problem were the method of Brailovskaya (BR), a 
two-step explicit method; the alternating direction implicit method (ADI); a fully implicit 
method (FI); and a new method by Randolph A. Graves, Jr., called partial implicit ization 
(PI). After obtaining results on the serial computer, these methods, with some variations 
and exceptions, were then coded for the STAR computer using a FORTRAN- like language 
which has vector instructions. Timings were then obtained based on estimates supplied by 
Control Data Corporation. These timing . give a sample of the effect of vectorization on 
the relative efficiency of the several methods. 

SYMBOLS 

ADI alternating direction implicit method 


1 


A, .,B,C- , coefficients in tridiagonal matrix 


! 


T'' UP— 

* 't 




method of Brailovskaya 


Dj j,Ej j»F,Gj j,H^ j coefficients of finite-difference equation for 
DV(M) degree of vectorization by approach M 


= *1+1,1 - *i-Ui 
°yi,j = ^i,j+i - ^i,j-i 
_ N Re At 

— r“i.i 

Np- -At 

dy i.i ■ — T _Dy *.i 

FI fully implicit method 

g amplification factor 

h spacing between grid points 

I column number in ADI formulation 

■ * 

i, 3 grid location 

J row number in ADI formulation 

K time step 

k order of the number of vector computations 

K. ,,I • ,,N. 1 , 0 . , quantities in. PI solution (eq. (19)) 

kpkg constants 

L number of results per clock 

l order of the average length of a vector 


length of vector 


3 


! 

i 


“f* 3 


a particular vectorization approach 


number of serial computations 


number of grid lines in each direction 


n(log 2 n - l) + 1 
log 2 n 

Reynolds number 


vectorization of a tank in which 0(k) vector operations are performed 
involving vectors whose average length is 0(1) 

method of partial implicitization 


Stone's algorithm 


repeated tasks 


right-hand side of tridiagonal system of equations 


vector startup time in clocks 


implementation of ADI using Stone’s algorithm 


vector timing in clocks 


velocity of sliding wall 


coordinates 


vorticity 


vorticity at intermediate step 


4 


h 2 


AtN, 


Re 


8y 2h 


„ _ 8* AtN Re 
°2 - -z. 


8x 


* 


2h 

stream function 


STATEMENT OF THE PROBLEM 

The problem chosen was to find the steady-state solution for the flow of a viscous 
incompressible fluid in a square cavity driven by a sliding wall as shown in figure 1. 

• Since it was the purpose of this report only to compare several methods when applied to a 
representative problem, this particular problem was chosen because of its relative sim- 
plicity and the availability of previous results (refs. 1 and 2). 

The governing equations are written in a time-dependem form and the solution pro- 
cess is a time-marching procedure to the steady-state solution. By introducing the 
stream function i//(x,y) and vortieity £(x,y), the governing equations become, after suit- 
ably nondimensionalizing and scaling the time by a factor N R£ , 




(1) 

5. +N - 

at Re [_W A 3 */ 

(g)(1)]- 2 ' 

(2) 

if/ = 0 

(for all walls) 

(3a) 

^ = 0 
9n 

(for stationary walls) 

(3b) 

_i 

an 

(for moving wall) 

(3c) 


The value of 100 is used for the Reynolds number in all computations. 

The boundary values for £ are computed from equation (1) and the boundary con- 
ditions for i//. (See refs. 1 and 2 and the discussion on pp. 6 and 7 for more complete 
details.) 


5 



. i 



SERIAL. SOLUTIONS 


Several methods were used on a CD C 6600 computer to obtain results. The unit 

square was divided into an equally spaced n by n grid network and the differential 

equations (1) and (2) were expressed in a finite -difference form. In all methods, central 

differences were used for the discretization yielding an o(h^ } spatial approximation to 

the original equations where h = — . For example, 

1 


^ )-*(*,- 4 *,, ) 

S(*t J |) + °< h2) 


W 


In the notation used herein (fig. 1), 

^i, j ~ >M iAx >J A y> = >Mih,jh) 
so that equation (4) becomes 



2h 


0(h2) 


(5) 


Similarly, the Laplacian operator becomes 



*1+1,J + + *i,J+l + *i,J-l " 4 *i,j 


h2 


+ 0(h2) 


( 6 ) 


For the purposes of this report, the Poisson equation (1) was considered to be an 
auxiliary equation and was solved in an identical fashion for each method; therefore, its 
solution time was not included in the timings presented. It was solved in all cases by a 
fully implicit method. This involved the solution of a positive definite banded system of 
equations and was achieved with a banded Cholesky decomposition scheme. 

Figure 2 shows the program flow chart. The initial was taken as n copies 
of a vector which was ordar of magnitude correct with the results of Mills {ref. 1) at the 
center line of the grid. For a given estimate of ? K , equation (1) can be solved for ^ 
(originally K = 0). Now < K+ 1 can be computed on the four boundaries. Let 
N = n + 1; then, the four boundary equations are as follows: 


6 


Right boundary. 


... 


Left boundary. 


- 2^1 « 
Cn , = ■■■»*■ 

S °,j h 2 


Lower boundary. 


5 M> = 


Upper boundary. 


ti.N = M* - ^.N-l) 


These equations are derived by assuming the existence of an imaginary point outside the 
boundary and using the governing equations at the boundary to eliminate it. Figure 3 
illustrates this procedure for the right boundary. 

Vi 1 

The computation of £ at the interior points is now performed by one of the 
previously mentioned methods. The finite-difference form for equation (2) is 


-K+l -K 


°yi,j A’i+i,j ■ ? i-i,A ij+i ■ ? u-i 


where 


? i+l,j " 2? i,j + ? i,j+l ? i,j+l " 3? i,j + ? i,j-l 


**1,3 = ^i»3+l - *i,M 


^.j = ^i+1, j " *i-l,j 


The time superscript has been deliberately deleted from the vorticity ? since this is a 
function of the various methods. All \{/ values are assumed to be 
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Brailovskaya's Method 

Brailovskaya proposed a two-step method in which intermediate vcrticity values 


ZK+1 


j are computed from equation (8) by using ? from time step K (see, for example, 

ref. 3) and then the intermediate values are inserted into the convective terms. The equa- 
tions for this problem become 

■ ■ c £i,i) ■ ( Dx i,i)( ? w-i-i - ?5-i) <"• 

■ tt 1 - ^ $ (&I.I * - < * *♦. ^ <&-») 

- - *£&) - w» 

Brailovskaya's method (BR) is an explicit method and is stability limited. Carter 
(ref. 3) has analyzed the stability of BR on the Navier-Stokes equations. Adapting the 
present problem to his analysis yields the stability criterion 

At 5 0.205h 2 

The method itself is comparatively simple to implement. Note that the work 
involved is o(n 2 ) per time step since equations (9) and (10) are evaluated for each of 
the n 2 grid points. 

Alternating Direction Implicit Method 

The alternating direction implicit method (ADI) uses two difference equations at 
each point in alternate sweeps through the grid (ref. 4). At t = 2K + 1, equation (8) is 
written, a row at a time, with spatial derivatives implicit in the x-direction and explicit 
in the y-direction. Thus, equation (8) becomes 

-2K+1 -2K . At|7i.2K+l 0 -2K+l ^ -2K+f\ A.2K „.2K A .2K \ 

? i,l = + ^[yi+1,5 " 2? i,j + C i-l,j; + ^i.,+1 " 2? i,j + ? i,i-lj 

N Re L 2K+1 J.2K+ 1\ ^ f„2K „2K \] 

- -^rj Dy i,j(fi + i,j - *1-1 ,, )- -i,^i, i+ i - ? i,j-lJJ (11) 


3 




$ 


T 











2 

Multiplying fay h and gathering terms yields from equation (11) a system of equations 
for each horizontal line of points in the grid. Now, let y = J Ay for the Jth row. Then, 


B 

C M 



B 

C 2^I 


A 3J 

B 


-2K+1 

,2K+1 

J 




( 12 ) 




c n-l^I 

B 


-2K+1 

‘■n.J 



where if 


dI i,i 


4 M 


= 


_ N Re At 


(12a) 

(12b) 


then, 

B = -(2 At + h ? ) 


(12c) 


C i,J = At ' dy i,J (12d) 

A i, j = At + d yi, j ( 12e > 

*i,J * fi,j( 2it - h2 ) * + ? J5-l(- At + <6t l,j) < 12I) 

Both Rj j and R. n have an extra term since differencing about ^ j and £ n j 
includes the known values on the left and right boundaries, respectively. They are modi- 
fied from equation (12f) as follows: 


~ R 1,J * 1 A l, J 

o _ d _ w2K+ 1 p 

K n,J _K n,J ? n+l,jSi,J 


(12g) 

(12h) 
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i . , ■ I 
I .! 


This system o£ equations is often solved by Thomas' algorithm, which is equivalent 
to a Gaussian elimination factorization of the matrix without pivoting. The steps of the 
algorithm, dropping the J subscript for simplicity, are 

w l* B "I 


P 1 " C l/ W l 
^1 " R l/ W l 


" B “ A j P j-l 
R. - AjQ. t 


P j = C fl"i 


(J = 2, 3, • • n-1) 


Then, 


* B - Wl 

O - R n ~ V^n-l 

Tl ~ W_ 


*n = «n 


" Q J • P ft+1 


(j = n-1, n-2, . 


For each row of the grid ( J = 1, 2, . . n), a similar structured system is generated 

which is similarly solved. Each row is solved independently of the other. This fact is 
taken advantage cf when the solution process is set up for the STAR computer. 

When the direction for the next time step 2K + 2 is alternated so that the implicit- 
ness is in the discretization of the derivatives in the y-direction, the following equation, 
which is similar to equation (8), is obtained: 


10 


r 






/ 


-2K+2 _ -2K+1 . Atf?-2K+1 or 2K+l ^ -2K+1N ^ /-2K+2 --2K+2 -2K+2\1 

%i ~ + ^2 Ll ? i+i,i ' 2? U + J + " 2? i,j + ? i,J-l JJ 


N Re At 


4h 


L /.2K+1 r 2K+l\ _ A. 2] 

py^Ji+M 


A.2K+2 -2K+2\T 

j+1 " h,i-l )] 


Now, the equations for the Ith column (that is, x = I Ax) become 
B Cl,i 

| *1,* 8 C! j 

Ai ( j B 


^.n 


wheie using equations (12a), (12b), and (12c) 


C I,J = At + 

A l,j e At " dx l,j 







$? 2 


Bi.a 


• 

at 

• 




Bl,n_ 


R i,i - *u - ^o +2 A U 

■r _ r _ ,2K+2 „ 
K I,n “ *T,n ? I,n+l '“I.n 


(14) 


(15) 


(15a) 

(15b) 


B « - 5?J +1 ( 2it ' ***) * * 5l- K M (- At - ’’’’l,)) (!5o) 

Modifying as previously yields 


(15d) 

(15e) 


The amount of computation involved in the solution of equation (12) by Thomas' 
algorithm is O(n) and, hence, the computation per time step for the n systems is 
O^n^). The changing of directions presents added programing complexity but the alterna- 
tion of direction is necessary since it is this process that gives the unconditional stability 
after two equal time steps. 

Performing a linearized stability analysis (ref. 4) for this problem results in 


-2K+Z _-2K 

? i,j “ g? i,j 


i 

! 


11 


The amplification factor g is given by 


^ 1 - 4p sin** — -j - 2ic sin Ax ^1 - 4p sin* 5 + 2i0g sin kg Ay 

^ 1 + 4p sin^ “^g - } *" 2io i s * a Ax ^1 + 4p sin** - 2iOg sin kg Ay 


2^2 Ay\ 

sin^ -=^ — J + 2io-g sin kg Ay 


where k^ and kg are constants and 


P-^ 

h 2 


<r -** AtNRe 
1 Sy 2h 


_ AtN Re 

2 3x 2h 


The Von Neumann condition for stability is jg{ < 1. This condition is satisfied since g 
is composed of two factors and each factor is of the form 

f = a+Jb 
e - ib 

where |a| i |e[. Hence, it can be shown that |f | £ jl j. 

Fully Implicit Method 

For the fully implicit method (FI) the values of j in equation (8) are taken at 
time K + 1; this results in the following equation for the i, j point: 

r» »K+1 £. -K+1 , »K+1 . rr *K+1 , r-. >»K+l _ u 2i»K /<a\ 


«5J * «u - °u «3?, * «u *S3 - ' cjj 1 - -*$, m> 


where 


D i,J ' At * d >'u 
E i,j - At - *t,i 


G i,J " “ + "%! 
H i,i ' at - d \l 


F = -4At - h z 


Equations (16) satisfies the Von Neumann condition for stability since here the amplifica- 
tion factor g is 


g = r (17) 

/ k- Ax kg Ay\ / \ 

1 + 4p Isin 2 — - — + sin" 2 — - — J + i 1 20j sin kj Ax - 2ffg sin kg Ay J 


and clearly Igl 5 1 since g is of th .* form g = — - — where a = 1. 
1 ' a + ib 
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The use of FI introduces several computational burdens. The resulting matrix is 
n 2 by n 2 and when a row by row ordering scheme is used, it becomes a banded matrix. 
This matrix is said to have a bandwidth of n and when banded programing techniques are 
used, the order of computation is o(n^) and the required storage is o(n 2 ) which is 
considerably higher than that required by ADL It should also be noted that although the 
band itself is sparse, it quickly fills so that after the elimination of the first n variables 
(one row) the submatrix for the next n variables is full. Thus, it is not possible to take 
advantage of sparsity within the band. Figure 4 shows the band structure and the fill that 
occurs after the first n variables are eliminated. 

The reasons FI was considered in spite of these disadvantages are as follows: 

(1) Recent advances in sparse matrix theory reduce these computation and storage 
figures to o(n^) and o(n 2 log 2 n), respectively. (See ref. 5.) 

(2) The solution procedures for the tridiagonal systems in ADI appear eo '% 

nonvectorizable. 

(3) It was considered possible that FI might have better convergence properties 
than ADI (require fewer steps to reach steady state). 


Method of Partial Implicitization 

The method of partial implicitization (PI) has recently been proposed by Graves 
(ref. 6). hi this method he has been able to express explicitly in terms of past 

information and at the same time retain the stability characteristics of a fully implicit 
method. 

The derivation for the stated problem proceeds in the following manner. Observe 
from equation (16), which is repeated for convenience, 


d r K+1 
D i,S M-1.J 


w i.K+1 f-% -K+l 

E i,j ? i+l,j G i,j ? i,j+l 


+ H. 


K+l 




, r. r K+l _ . 2 r K 
+ FC U ~ - h 


that the five grid points included in the general equation form the familiar star or cross 
pattern with ■ at the center. Based upon the presumption that it is these four neigh- 
bors that exert the most influence upon the solution at that point, the general equation for 
each of the four neighbors is also written. References to grid points within the star for 
j are made implicitly, whereas those outside are expressed explicitly. The resulting 
five equations can be expressed in matrix form as follows: 


i 


G i,j °i,j 


° ° K i,J 

0 0 't K -i, 1 j h,, 

e m H i,i 'u 1 ■ M i,i 

F 0 \j 

0 F ? M-i °i,i 


where 


v _ u*r K n r K v r K n r K 

~ ~ h C U+1 ' U i,j+1 M-1J+1 ‘ E U+1 M+1,J+1 " M,j+2 

Hi = " h2 ^ 1,1 ‘ D i-l,j ' G i-l,j ? £l,j+I ‘ 


M. . = -h^f 


N i,j • - h2 ^ + i,j ' E i + l,j «* «,j - G i + l,j £i, J+ i - H i + l,j &1J-1 
°i,i = _h2? U-l ‘ ‘ E i,J-l C S-l,j-l ' Hj-l 

V , 4 

Matrix equation (18) is solved for j by using Cramer’s rule. The result is 

;K+1 _ ™1,J - ( N i,i E i,i * °i,i H l,i * L l,i D i,i * K i,) G U ) (i9) 

1,1 f2 - Ki D i +1 ,i «i,j g u-i * D i, f * g u 

Equation (19) can be used for all the interior points except the points adjacent to the 

boundary. Although it is possible in such a case to simply remove the equation for the 

K+ 1 

boundary point from the system and rederive the expression for £. - , it is desirable 

for the vector operation of the STAR computer to maintain formula (19) for all points. 

This objective can be accomplished by modifying some of the appropriate constants in 

matrix equation (18). For example, let t. , be a point adjacent to the top boundary. 

K+ 1 

Then is known. The resulting four equations to be solved can be expressed in 

the same format as matrix equation (18) as follows: 


i 




" 1 

0 

H U*i 

0 

o" 

■;k+i' 



0 

F 

E i-l,i 

0 

0 

rK+1 

M-1,1 



G iiJ 

D u 

F 

E i,J 

H i,J 

-K+l 

M,j 

* 

M U 

0 

0 

D i+l,J 

F 

0 

-K+l 

M+1.J 


N i,i 

0 

_ 

0 

G U-1 

0 

F 

-K+l 


_°i,i_ 


( 20 ) 


where the following changes axe made to arrays of coefficients and right-hand-side con- 
stants given in matrix equation (18): 


First, 


then, 


M i,i * M i,J ' G i, 1 


= 0 

H i,J+l = 0 
K i,j " 


(20a) 

(20b) 

(20c) 

(20d) 


The system of matrix equation (20) is now correct for the point near the boundary 
and equation (19) can be used for this point also. Similar logical changes can "be made for 
points near the other boundaries and near the corners. The amount of computation for 
PI is o(n^) but requires about twice as many operations as ADI; however, the explicit 
nature of PI allows it to vectorize much better than ADI on the STAR computer and, in 
fact, the simplicity of the PI form may make it popular on a serial compute-. 

The stability of this method has been verified for the two-dimensional heat equation 
and by Graves for Burgers’ equation. The grid sizes run for this problem showed no sta- 
bility constraints and in fact demonstrated a seemingly complete insensitivity to At. 


Results of Serial Computations 

The results from the comput dions done on the CDC 6600 computer using the four 
methods are presented in table L The best results are reported for each method. The 
following observations can be made regarding these results: 
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(1) Despite the fact that the FI took fewer total steps to reach convergence, its 
large computation time heavily negated this slight advantage. Since it did not show any 
great advantage over the other two stable methods, it will be omitted from consideration 
in the rest of this report. 

(2) ADI and BR took approximately the same amount of time per step and PI 
required about twice that amount. 

(3) Neither PI nor ADI greatly reduced the total number of steps to convergence. 
On the average they took about one -third as many steps. Interestingly, PI had the 
characteristic of being insensitive to the size of At selected if At was greater than 
some number. For any At larger than this value, steady state was reached in the same 
number of steps. ADI always reached a point at which a larger At would cause the 
results to diverge toward infinity. 

(4) As n doubled, the ADI, BR, and PI methods required about four times as 
many steps. 

(5) Of the three methods under consideration, ADI performed the best and PI and 
BR performed about the same. 

VECTORIZATION OF THE PROBLEM 
General Characteristics of Vector Timing 

The STAR computer obtains an effective increase in computational speed by stream- 
ing consecutively stored data from the memory, through pipeline processing units, and 
back to memory so that the elapsed time between the production of successive results is 
much less than the time frcm beginning to end of any one computation. This process 
requires that the data be organized into a "ector format (that is, stored in consecutive 
locations in memory) and that the computations use STAR vector instructions. 

Since a comparison of different vector implementations is desired, it is necessary 
to first look at the general liming for a vector operation and understand its implications. 
The general form, given in clocks (1 clock = 40 nanoseconds), is 

it 

T = S + y- 

Lt 

where 


T 

• time for operation, clocks 


s 

startup time, clocks 

4 

V 

length of vector 


L 

number of results per clock 
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Some representative STAR timings are given in table IL These timings are based on an 
unpublished preliminary STAR timing summary (Aug. 1£> >2) supplied by Control Data 
Corporation and, inasmuch as they are preliminary timings, are subject to change. 

The startup time represents an inefficiency in the use of vector instructions. 
Obviously, the timing is best for a particular computation if it is performed with long 
vectors and few startups, if possible. 

Since the timing for a computation depends not only upon how many results are gen- 
erated but also on whether they were performed with a few long vectors or with many 
short vectors, it is convenient to introduce some notation and definitions to describe the 
vector implementation in an order sense. 

Presume that there is a. computational task to perform which has associated with it 
a parameter n which in some way characterizes the size of the task. In discussing 
quantities related to the computation of the task, the concept of order of magnitude at 
infinity with respect to n is used. Specify that f(n) = 0(g(n)) (read as "f and g 

are of the same order of magnitude for large n") if lim ^22 exists and lim 2^2 = k 

n— g(n) n-°° g ( n ) 

where O < K < «. 

For compactness of notation n is suppress ed and f = O(g) is written if the param- 
eter involved is clearly defined. Note that in the statement f = O(g), g is not uniquely 
defined by f; that is, 5n 3 + 6n 2 + 12 = o(sn 3 4- 6n Z ) = o(5n 3 ) - o(l0n 3 ) = o(n 3 ). 
hi the examples given herein, the simplest expression is always used; that is, 

5n 3 + 6n 2 + 12 = o(n 3 ). 

Consider now the implementation of the presumed computational task. In the fol- 
lowing discussion, m, l, and k are assumed to be functions of n, and the usual imple- 
mentation that carries out the computations on a serial computer oi 1 on the STAR com- 
puter without vector operations (referred to as the scalar mode) is assumed to require 
0(m) calculations. There may be many vector implementations (vectorizattons) of the 
task and it is assumed that a particular vectorization M requires O(k) vector opera- 
tions whose average length is 0(1). The vector order of computation of such a vectoriza- 
tion M is denoted by (k). 

Definition : The degree of vectorization by vectorization M, denoted by DV(M), is m/k. 

Again, since m and k are not unique. DV(M) is not unique, and DV(M) can 
be any f(n) which is 0(m/k). As before, the simplest form is used. 

On an ideal parallel computer (one which has an infinite number of processors 
which operate in parallel), DV(M) could be called the speedup x itio. On the STAR com- 
puter, DV(M) is only an indication of the speedup. 
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Definition : Vectorization M is a consistent vectorization if Zk = O(m); otherwise, 
it is called inconsistent. 

An inconsistent vectorization is hence one that produces a higher order of total 
results than the serial algorithm it replaces. Since the high-order term for the timing 
of the scalar algorithm is k^m and for the inconsistent algorithm is kgZk ( for some 
constants kj and k^), it is guaranteed that as n — 00 , the scalar algorithm becomes 
better than the vectorized algorithm. Naturally, the value of n for which this happens 
depends upon the coefficients mvolved and the relationship between kZ and m. There 
are examples of parallel, algorithms proposed which, if vectorized, would not be consist- 
ent. They are designed with the ideal parallel computer as the model and assume that 
timing is proportional to the number of vector operations involved and independent of the 
length of the vectors. This assumption is not valid for the STAR computer. However, 
there may be regions of vector lengths where these algorithms can be useful but with 
reduced gains. The recursive doubling algorithm of Stone (ref. 7) is an example of an 
inconsistent vectorization and is discussed in more detail subsequently. 

A consistent vectorization is considered to be optimal if mere is no other consistent 
vectorization of the task whose degree of vectorization is of higher order. Certainly this 
is true of vectorization M if DV(M) = m. It should be emphasized that the term 
"optimal'* refers only to the vectorization of the particular task in question. If that task 
is only a part of the overall solution procedure, then an approach in which that task vector- 
izes optimally need not be the best approach to take. For example, if the vectorization of 
several iterative methods is considered, and specifically the vectorization of the computa- 
tions invol ed in one iteration of each method, it is possible that one iteration of a method 
with poor convergence rates can be vectorized optimally, whereas a method with good 
convergence rates may have a vectorization of lesser degree. The method with the opti- 
mal vectorization is not necessarily the best approach to use since it requires more 
iterations. 

To illustrate the use of these terms, consider the vectorization of two approaches 
for adding two n by n matrices, namely, 

C = A + B 

Vectorization Let each column of the matrices be a vector. Then, n vector adds 

of length n yield 

T i = n ( 33 + f) = 33n + T 

This vectorization is O n (n), and DV^M^ = ^- = n. 
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V ectorization Treat the whole matrix as one vector of length n^. Then, 


2 

This vectorization is 0^(1), and DV^Mg) = — = n 2 . Note that both M^ and Mg 
vectorizations are consistent and Mg is optimal. 

The following features about timing are illustrated by this example. 

(1) The serial ordtr of computation o(n^^ is reflected in both timings. 

(2) Vectorization Mg is superior to Mj because it has fewer startup times. 
Note, however, that this difference shows up in a lower order term of the timing. There- 
fore, as n the two vectorizations become essentially equivalent in an order sense 

T 1 

since — — — 1. For smaller values of n, the lower terms are more important and 


(3) Since scalar timings for an c(n^) task would be T g = kn^, it has no lower 
order terms and for small values of n could be competitive with vector operations. 

(4) For consistent vectorizations, DV(M) is a meaningful rough comparison of the 
vectorization in terms of computations. 

In many applications it will be possible to specify or describe the efficiency of the 
vectorization of subtasks of the total problem but very difficult to determine the best 
approach to the overall solution of the problem. For instance, in the problem in this 
report, the emphasis has been on comparing the vectorization of the task of advancing the 
solution O” . step in time. The methods can be compared on this basis, but to specify, in 
general, i best method when one considers the total number of steps required for each 
method is difucult, if not impossible, and usually quite problem dependent. Therefore, 
results have been given in terms of what is reasonably constant for most problems of 
this type, namely, the computation time require ’ to perform one step toward steady state. 
Results will be given for total solution time for the three grid sizes computed on the CDC 
6600 computer. 

The vectorization of the various methods for this problem provides an excellent 
cross section of the philosophies involved in the selection of a method for the STAR com- 
puter. First, there is BR which runs slowly on a serial computer but vectorizes opti- 
mally. Then, tnere is ADI which outperforms BR serially but seemingly does not do 
well on the STAR computer because of the serial nature of the solution of a tridia?onal 
system. Two vectorizations of ADI are presented in this report. The first implemen- 
tation ST utilizes a new parallel algorithm by Stone which was formulated iu solve each 
tridiagonai system in a parallel fashion. This vectorization is found to be inconsistent. 
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The second implementation invclvos no new mathematics but only the recognition that the 
n tridiagonal systems are identical in form and independent of each other. Thus, the 
standard algorithm of Thomas can be used to solve all n systems at the same time. 

This approach is referred to as RT 'or repeated tasks. In this implementation, the 
importance of data storage in STAR is emphasized. Finally, there is PL This stable 
method is particularly well suited for the STAR computer since the solution is computed 
by an explicit -type formula which vectorizes very effectively. The vectorization of PI 
and BR demonstrates the importance of control vectors. 

Each vectorized version is compared with the others through estimated STAR 
timings. The relative speeds of the /tutorized versions are compared with the relative 
speeds of their serial counterparts to demonstrate the effect of vectorization on the meth- 
ods and also to quantify some of the order concepts developed earlier in the report. 

Assumed FORTRAN Extensions 

The programs presented are coded in a FORTRAN -like language with extensions for 
Vector operations. Since the language which will actually be used has not been finalized, 
the code given here is only presumed to be representative of the final version. Several 
liberties will be taken with the code in order to make it more readable. These will be 
pointed out. A description of the FORTRAN extensions used follows. 

Implied DO 

A sequence of elements from an array A can be specified by an implied reference 
A{M1:M2:M3) where Ml, M2, M3 have the same meaning as they do in the DO state- 
ment DO 50 I = Ml, M2, M3. All vector operations must involve consecutive loca- 
tions in core. Therefore, it is presumed that the reference A(I:J), which represents 
A(I), A (1+1), . . ., A(J), will generate vector operations, whereas A(I:J:2), which repre- 
sents A(I), A(I+2), . . ., A(J), will generate scalar code. It is also possible to use 
implied DO references within multidimensional arrays as long as the reference occurs 
only in one of the indices, for example, A(I,1:M) or A(1:N,J). Note also that if the 
STAR computer stores arrays by columns first, then the latter reference is to consecutive 
locations and therefore can be considered to be a vector, whereas the former reference 
cannot. 

BIT 

BIT is a type statement identifying a variable or array of variables each to be one 
bit long. 

CTRL 

The STAR computer has in its hardware instruction set the capability to use a con- 
trol vector with its normal vector instructions. The control vector is a string of bits 
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where each consecutive bit corresponds to a consecutive element of a vector generated or 
computed in some vector operation. If a bit is a 1, the corresponding element of the 
result vector is stored. If the bit is a 0, this computation is not stored but merely dis- 
carded. The assumed FORTRAN code to use this feature will be 

A = B .CTRL. (EX) 

where EX is an expression giving rise to an array or vector of results, and B is a bit 
vector declared in the BIT statement. As an example of its use, consider the following 
program: 

DIMENSION Al(6), A2(6), C(6) 

BIT B(6) 

C = B .CTRL. (A1 + A2> 

END 

Assume that Al, A2, C, and B have the following data before computation: 

C = [2,2,2,2,2,2] 

Al = [3,3,3,3,3,3] 

A2 = [4,4,4,4,4,4] 

B = [l,l,0, l,0,l] 

Then after computation, 

C = &, 7, 2,7,2, 7] 

This feature is desirable in boundary value problems because it allows one to include 
the boundary points in the arrays of variables and thereby form a vector which includes 
all grid points. One can then use this long vector in an expression which is valid for the 
interior points, but not for the boundaries, and yet which does not destroy the good infor- 
mation at the boundaries by overstoring it with a quantity computed using an invalid 
equation. 

Two-dimensional arrays are assumed to be stored by the STAR compiler by columns. 
For clarity in reading, the capability to reference a two-dimensional array as if it were 
singly dimensioned is assumed. 


Vectorization of Brailovskaya's Method 
The equations for BR are of the form 

1 - F i,J + D i,j £l,J + E i,i &1.J + \i £j-l + G i,j 
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Since each value of vorticity and each coefficient used in the right-hand side of the 
equation are the result of a calculation at the last time step, it is possible to perform the 
calculations with vectors of length n^ by using the following procedure. Let Z be a 
FORTRAN array containing the values for tune K. First, compute the coefficients F. j, 
using vectors of length n^, and store the result in the array F. Next, perform the vector 
multiplication F*Z and store the result in the temporary vector Tl. Similarly, compute 
D*Z with the proper offset in Z and store in T2. Then, add Tl and T2. Continue in 
this fashion until uie entire equation has been computed. This vectorization is obviously 
0^(1) and is an optimal vectorization for the task. The coding in appendix A does not 
perform the computations in precisely this order since it is possible^ take advantage of 
similarities in the two steps of BR. 

The STAR FORTRAN coding uses vectors of length r? with the storage arrangement 
as shown in figure 5(a), where the elements are stored consecutively by columns beginning 
In the lower left-hand corner. The grid as shown in the figure includes the boundaries so 
that all the information needed to compute the interior points is contained in the vector. 

All the vectors needed in the computation are used in this manner and since it is necessary 
to give the starting and the ending location of the vector, the proper offsets must be com- 
puted for the vector instructions. This is done at the beginning of the coded example. To 
compute the first result, Z(MC), four points are needed. The subscripts of these points 
are used as the beginning subscripts for the implied DO notation. The last result com- 
puted is Z(NC). The subscripts of the points it needs are used as the ending subscripts 
in the implied DO notation. The results are computed in order from Z(MC) to Z(NC). 

A bit control vector is used to prohibit storing results on the boundaries. The bits 
in the control vector corresponding to the boundaries have the value 0, and the remaining 
bits corresponding to the interior points have the value 1. (See fig. 5(b).) 

The STAR FORTRAN coding for the Brailovskaya method is given in appendix A. 

Vectorization of Method of Partial Implicitization 

Since the general equation for PI has the same form as for BR, it is again an 
0,(1) vectorization and, hence, is optimal. In order to maintain an order of n^ vector- 

n* 

ization for the method of partial implicitization, it is necessary to use equation (19) for the 

computation of all the interior points. In the evaluation of K, ,, L. ., N, ., and O. . 

. *>J *>J 

appearing in equation (19), the points j + g , ?^_2 j> %i + 2 311(1 j-2 are neede< l* 

For . adjacent to the boundaries, these points do not exist; therefore, two columns are 

J 

appended to the original grid points, one on the left side and one on the right side. Now 
the vector contains an appropriate number of elements and thus equation (19) can always 
be used without referencing nonexistent points. It does not matter what the contents of 
the two appended columns are because, as noted earlier, for the points adjacent to the 
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boundaries, equation (19) is modified as described by equations (20a) to (20d) so that the 
terms containing the nonexistent points are multiplied by a factor of 0 or are reevaluated. 
In the FORTRAN program, the terms D, E, G, H, K, L, M, N, and O are all 
evaluated by using vector instructions of length n^. Next, the modifications are made 
according to equations (20a) to (20d). Some of the modifications can use vector instruc- 
tions of length n, whereas others will be scalar code. Then, equation (19) is computed 
by using vector instructions of length n2. 

The STAR FORTRAN coding uses vectors with the storage arrangement as shown in 
figure 6(a), where the elements are stored consecutively by columns beginning in the lower 
left-hand corner. All vectors have the same length even though they may not be filled 
completely. This is useful when computing subscripts because corresponding elements 
in the vectors have the same relative locations. To compute each result, the twelve 
closest points are needed in the equation. The results are computed in order beginning 
with Z(MC) and ending with Z(NC). 

A bit control vector is ured to prohibit storing the results on the boundaries. The 
bits corresponding to the boundaries and the two appended columns have the value 0 mid 
the remaining interior bits corresponding to tne interior points have the value 1. (See 
fig. 6(b).) 

The boundaries are computed as they are in Brailovskaya’s method; therefore, the 
code is not repeated in this example. The STAR FORTRAN coding for the partial implic- 
itization method is given in appendix B. 


Vectorization of Alternating Direction Implicit Method by Repeated Tasks 


As stated previously, the use of ADI gives rise to n different systems of equa- 
tions, each tridiagonal and each independent of the results of the others. The task of 
solving any one of these systems is serial in nature as evidenced by the recursive nature 
of equations (13a) and (13b). However, it is possible to obtain a degree of parallelism by 
noting that each task has n-fold repetitiveness in that the operations required to solve the 
first system are repeated for the other n-i systems. Therefore, by correctly arranging 
the coefficients in storage, Thomas’ algorithm can be used in the vector mode. For 
instance, if is assumed to be a STAR vector composed of the following elements (see 
fig. 7) 




(l* 1,2,. . n) 



and the same is done for A., Dp then Thomas’ algorithm can be used in the same form as 
equations (13a) and (13b) except that each operation is now a vector operation of length n. 
This vectorizaiion is then on O n (n) vectorization of an o(n^ ) task. 

To actually implement this idea, one has to be certain that the coefficients tliat are 
to be used as vectors are stored consecutively. Since the coefficients must be computed 
and each is the result of an operation on elements of indexed arrays, it seems possible and 
is desirable to compute and store the coefficients by using vector operations. 

In order to illustrate the importance of making the correct decision about data organ- 
ization in vectorizing the prcftiems, consider the specific grid in figure 5(a) for n = 5. 
Assume that the FORTRAN vector PSI and ZETA contain 


PSI = {Vj, i f/ 2 , & 3 , • • ^ 49 ] 

and 


ZETA = ? 2 , ? 3 , . . ., ? 49 ] 


and that the first step is implicit in the y -direction. Then, for example, 
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wheie CON = N Rg * At/4. 

However, none of these operations are vector since the xp'e in the operations indi- 
cated are not stored consecutively. However, if the step is taken implicit in the 
x-direction, the following computations for are obtained: 
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All these operations are vector operations for p stored as indicated. The con- 
clusion is that for the assumed implicitness in the x-direction, it is necessary that the 
two-dimensional array of p and £ variables be stored by columns of the grid. The 
opposite is true when the implicitness is in the y-direction. This last fact forces a 
rearrangement of the PSI array each time the direction is alternated. The rearrange- 
ment is a fairly expensive task but is necessary in using this vectorization of the ADI 
method. Note that in the discussion of Stone's algorithm, the opposite correlation between 
direction of implicitness and direction of storage is desirable. 

It should be pointed out that the rearrangement of the vectors is not only expensive 
computationally, it alro could be slowed considerably because of the paging system of 
storage in the STAR computer. Information is brought from the disk to core in pages. 

II, in one sweep, a row of vorticities is on one page, then a column is on many different 
pages. The necessity to bring many pages in and out of core to reference a column repre- 
sents an overhead to the rearrangement that is not shown in the vector timings and could 
be quite significant. 

The following comments help to make the STAR FORTRAN coding for the ADI 
method presented in appendix C more readable: 

(1) Ry inserting just a small bit of logic and changing a few signs, essentially the 
same code can be used for the solution in both directions. 

(2) Two arrays DERI and DER2 are used to store the p derivatives. 

DER2(I) = PSI(I + 1) - PSI(I - 1) * CON 
DERI (I) = PSI(I + N) - PSI(I - N) * CON 


where, when implicit in x-direction, 


CON = 


N Re At 


and when implicit in y-direction, 

CON = - -fi eAt 
4 


Therefore, when implicit in x-direction, 
Np h At 

DER2(I) = P y (l) 

N R h At 

DER 1(1) = p(l) 
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After the rearrangement of the if/ vector for the implicitness in the y-direction 
■No h At 

DER2(I) = 5^.— ^ X (I) 


DER1(I) 


-N r h At 

- 5 V-’V I > 


(3) When the ^’s are stored by columns of the grid, the DER arrays do not 
include the left and right boundaries. The top and bottom values are computed only 
because it is desirable to do the computation on long vectors (vectors which include all 
the interior grid points). 

(4) All two-dimensional FORTRAN arrays are assumed to be stored consecutively 
by columns so that when they are used in computation, only implied DO in the first index 
generates vector code. 

(5) Since Thomas’ algorithm in the scalar form only requires A^ + p Cb + p R^ + ^ 
after computation is finished with A., Cp Rp it is possible to have just one A, C, and 
R vector at each step of the algorithm. 

(6) In Thomas’ algorithm; Qj and P- are computed with a division by Wj. Since 
vector division is slow compared with multiplication, the two divisions have been replaced 
with two multiplications by 1 j-w.. 

Vectorization of Alternating Direction Implicit Method 
Using Stone’s Algorithm 

Stone has proposed a parallel algorithm (RD) for the direct solution of a tridiagonal 
system of equations. He notes that in the factorization of the matrix A into the product 
of a lower triangular matrix L and an upper triangular matrix U, the resulting equa- 
tions are recursive and of the form x. = b^x-_ ^ +■ c^x -_2 for the factorization and 
x i = ^i x i-l + c i ^ or the f° rwarc * and hack substitutions. These operations involve O(n) 
calculations when done serially. Stone uses recursive doubling to perform the calcula- 
tions in log 2 n vector operations. Recursive doubling is the effective subdivision of 
work in a task into subtasks which have similar form. A simple example is the calcula- 
tion of the sum of n numbers. This is merely x n in the sequence given by 
X 1 = a l» x i = x i-l + a i i = 2, 3, . . .. n. The calculating sequence is illustrated 

as follows for n = 4: 

Initially, 



Step 2, 




The computation of the vector order of calculation provides some interesting results 
and are presented here for the summation problem. Let n = 2*. The first step of the 
parallel algorithm is a vector add of length n - 1; the second step is an add of length 
n - 2; the third step is an add of n - 4; and the kth step is a vector add of n - 2 k “l. 
There are £ = log 2 n such steps. Therefore, the average vector length is given by 


dr£ n ' - ( 2t - *)] 


log 2 n L 


1 / \ n ( l0 &2 n " l) + 1 

= — i — (n logo n - n + 1) = - 

Irtrr n ' ® Inrr n 


log 2 n 


log 2 n 


Now as n gets increasingly large, — n. Thus, the vector order of computation 
is O n (log 2 n) for a task which is O(n) serially. This means that work which is 
O^n log 2 nj is being done in the vector mode and, although each computation is being done 
more quickly in the vector mode, there will be some value of n for which this approach 
is not beneficial and can be beaten even with scalar coding. 

Table in contains the estimated timing for the solution for one tridiagonal system 
with N equations using scalar coding and recursive doubling. Scalar coding is seen to 
be faster than the parallel algorithm for very small systems (N <32) and for very large 
systems (N > 8192). This is not very surprising. For large values of N, the O^N log 2 N) 
order of work in the vector mode takes longer than the O(N) scalar order of work. When 
the vectors are short, the startup time for the vector operations is important and accounts 
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for the ratio in that region. Even in the region where Stone’s algorithm is faster, it is 
only slightly better since once the vectors get long enough to make the startup less impor- 
tant, the higher order of work is being felt. It is concluded that although the recursive 
doubling approach is probably effective on a computer such as the ILLIA.C IV, its advan- 
tages on the STAR computer are much less. 

The recursive doubling formulas for the solution of the tridiagonal system are not 
easily presented. A detailed analysis is presented in reference 7 as well as a FORTRAN- 
like algorithm for carrying out the procedure. That algorithm has been used in appen- 
dix D with the assumption of a capability of zero and negative indices. 

An interesting feature of the vectorization process for this approach is that the 
desired relationship between direction of implicitness and the storing of the variables is 
the exact opposite from that desired for the RT vectorization. Here, since vector oper- 
ations involve the coefficients of the particular tridiagonal system being solved, it is 
desired that the unknowns for that line be stored consecutively. For example, in solving 
implicitly in the y-direction, it Is necessary that PSI = <^2» *3> * * •» *49) 

because here the vector 
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and all computations involve STAR vectors for PSI as indicated. 

Results of Vectorized Methods 

Table IV gives a summary of the vectorizations for the four methods. BR and PI 
both 'ire optimal degree vectorizations. Of the two ADI vectorizations, clearly the 
veutorization using repeated tasks RT is better. Table V gives a summary of the 
timings for the several vectorized methods. The graph of these formulas, as a function 
<»£ n, appears in figure 8. Since BR was the minimum for all n, each time has been 
normalized to a value of 1.0 for BR. Again, it should be emphasized that this graph 
.•efiects only the amount of time required to perform one step of the various algorithms 
„nd does not include convergence rates. 
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The graph reflects the following interesting concepts related to the vectorizations: 

(1) The ratio of the two ADI methods to BR varies with n. This type of varia- 
tion does not exist for a serial computation where the ratio is a constant. This variation 
is, of course, due to the different degrees of vectorization obtainable in the methods. 

(2) The greater values of RT, when n is small, are due to the poorer vectoriza- 
tion in the method. Recall from an earlier discussion that the effect of the degree of 
vectorization on timing is most significant when n is small. 

1 (3) As n — °°, the value of RT/BR becomes nearly constant. This constant is 

| just the ratio of the high-order terms in the timing. (See table V.) 

(4) The ratio PI/BR is essentially constant since both PI and BR are r? 
degree vectorizations. 

(5) ST has the general shape of the tridiagonal timings computed earlier. When 

! n is small, ST is greater due to the weak vectorization ^ rylogg nj . As n gets 

1 larger, this fact becomes less important, but then the inconsistency of the vectorization 

^ logg n termj begins to dominate. 

| It is clear that RT would be superior to ST for ail n regardless of conver- 

I gence rates since they are vectorizations of the same method. The choice between the 

I others, of course, will be influenced by the total number of steps required to reach steady 

| state. Table VI presents the predicted normalized computer run time of the three best 

! vectorizations for the grid sizes for which the number of steps to convergence (table I) 

i are known. It is of interest to note that ADI, which was the fastest serial method, is 

i 

now the slowest with respect to the STAR computer. The PI and BR vectorizations 
; are the fastest and are nearly equivalent since the longer time per iteration of PI has 

been offset by its fewer steps to convergence. It should be noted that if the trend shown 
in table I in going from n = 13 to n = 27 continues (that is, as n doubled, each method 
took about four times as many steps), ADI with RT will approach the other two in 
( STAR timing since its vectorization becomes relatively better as n — °°. (See fig. 8.) 

However, as n gets large, the poor program locality referred to earlier becomes more 
important even though it doesn't show up in the timings. 

Although it is currently impossible to say how important this effect might be, an 
alternative is suggested for the ADI here to offset this effect should it prove to be large. 
The poor program locality is caused by the need to transpose the vorticity and stream 
function values. If this can be avoided, then so is the locality problem. It is recalled 
from the discussion of the two ADI vectorizations that ADI with RT required 
storage of the grid opposite to the direction of implicitness, whereas ADI with ST 
required storage in the same direction of implicitness. The following algorithm may 
be worth considering for the ADI: Let the grid be stored columnwise. 
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(1) At step 2K + 1, Let the implicitness be in the x-direction. Use the RT 
vectorization. 

(2) At step 2K + 2, Let the implicitness be in the y-direction. Then, use ST or 
scalar solution of the tridiagonal system, or whichever tridiagcnal solver proves best to 
solve each column individually. (Note that even with using the scalar code to solve the 
systems, some vectorization is present since the coefficients in the matrix equation can 
be computed in the vector streaming mode.) 

(3) Go to step (1). 

The necessity for the transpose has been removed and, hence, the locality has been 
improved. Also, a costly operation (8n^) has been eliminated from the timings for the 
method. The timing for such a method would be approximately the average of the two 
ADI vectorizations minus 8n2 for the transpose operation. 

CONCLUDING REMARKS 

From the viewpoint of seeing as many facets of the vectorization process as pos- 
sible, the following benefits were obtained from this study of a specified fluid flow 
problem: 

(1) Examples of optimal vectorizations (Brailovskaya’s method (BR) and method of 
partial implicitization (PI)) and the importance of control vectors in achieving this 
optimality. 

(2) Examples of two completely different approaches to vectorizing a sequential 
problem (solving a series of tridiagcnal systems of equations): 

(a) The first approach (ST) uses a new mathematical algorithm by Stone to 
induce vectorization. It also demonstrates the effect of an inconsistent vectorization on 
the STAR computer. 

(b) The second approach (RT) takes advantage of the repeated and independent 
nature of the task to obtain the vectorization using the usual serial algorithm. Both 
approaches serve to illustrate the importance of data management. 

(3) An example of a new method (that is, PI) whose theoretical properties are most 
advantageous on the STAR computer. 

(4) A feeling, in an order of magnitude sense, for the effect of different degrees of 
vectorization. 

(5) The dependence on the number of grid points of the relative efficiency of the 
several methods. 





(6) A suggested approach to eliminate the locality problem in the alternating direc- 
tion implicit method (ADI). 


Although it is not reasonable to draw global conclusions about results generated 
from only one problem for only three different grid sizes, some conclusions can be 
reached about the three methods (BR, PI, and RT) for this problem, ho a comparison 
of the two stable (theoretically) methods, PI performed almost as well as ADI as 
regards number of steps and since the \ ectorization is better for a small number of grid 
lines in each direction and has no locality problems for a large number of grid lines in 
each direction, PI would be preferred. PI has a slight advantage over BR in the 
vectorized form but BR will be easier to adapt for less rejjular regions. However, 
the theoretical stability cliaracteristics of PI makes it an interesting method to con- 
sider for use with the STAR computer. 

Langley Research Center, 

National Aeronautics and Space Administration, 

Hampton, Va., March 29, 1974. 





APPENDIX A 


A STAR CODING FOR THE BRAILOVSKAYA METHOD 


The program listing for Brailovskaya’s method is presented in this appendix and 
uses an assumed FORTRAN - like language with extensions for vector operations. 

C 

C BRAILOVSKAYA M-iVHOC 

C 

C A ONE-OIMENSIONtO VECTOR IS USED 
C 

C ZO RESULT VOHTICITY VECTOR AT TIME, T 

C ZBftR INTcRMEOIATl result vector 

C Z RESULT VECTOR AT TIMe, T * DELT 

C NSW TOTAL NUMBER OF ELEMENTS IN VECTOR (INCLUDES BOUNDARIES! 

C N NUMBER OF cLcMENTS IN UNe COLUMN OF GRID 

C BZ BIT CONTROL VECTOR WHICH PRGHieiTS STORAGE ON BOUNOAR I E > 

C THL Z BOUNDARIES ARE COMPUTED FIRST, WHEN THE INTERIOR Z ELEMENTS 

C ARE COMPUTED BZ OOES NOT ALLCW THE NEW Z ELEMENTS TO BE STORED 

C AT THE BOUNDARIES 

C PSI STREAM FUNCTION VECTOR 

C TEMP, T2, T 3 ARE TEMPORARY VECTORS USED IN THE CALCULATIONS 

C H * 1/ IN— I J 

C 

DIMENSION ZINSUi , ZJ(NSQ), Z8ARINS0J, PSIINSOI, TEMPCNSOI 
l , T2INSOI< TbINSOI 
BIT BZ(NS'O) 

C 

C COMPUTE CONSTANTS 
C 

HSJ * H*H 

CONI = Dc.LT/HSW 

CUN3 » R * DELT/ C4.0*HSQ) 

CUN5 * 2.0/HSW 
NMl-N-l 
C 

C COMPUTE OFFStTS 
C 

C BEGINNING SUBSCRIPT 
M3 * N*l 
MC * N*2 
MA * N + 3 
MR * MC*N 

C ENDING SUBSCRIPT 

NC * tN-l ) *N - l 
Nd * NC-l 
N~ * NC* 1 
Nr * NC+N 
NL * NC-N 
IDO CONTINUE 
C 


l 
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APPENDIX A - Concluded 


C COMPUTE PSI 
C 

C EVALUATE BOUNDARY CONDITIONS 

2<2:.JMl> * -CuN5* PSI <2 SNM1I 
NN * NA*2 
NT ■ NL + 3 

2INT:NRJ * -CON5* PSIINlJNCT 
Z8ARI2SNMI) = 2123NML1 
ZBARINNSNR) » ZINN:NR) 

Nl Ml * NI-l 

Z(M8:NIML:N» *-CUN5*PSl (MC:NL:N» 

NN* 2*N 
Ml * NN* 1 

ZINNJNASN) * C0N5M-PSI INl:NCs N) ♦ HJ 
ZB AR( MB :N1M1:n1 * Z(MB: MM1 :NI 
28AR(NN:nA:N ) * ZlNN:NA:N) 

EVALUATE TEMPORARY VECTORS 

T2IMC:NC>= PSIIMA:NA» - PSICHBSNBl 
TJ IMG: NCI * PSI (MR: NR) - PSI(2:NL» 

TEMPI MC: NC ) * ZOIMCtNCJ + GONT* IZOIMftsNR) - 4.0*20 IMC:NC> * 20I2:NL) 
I *ZOt MAsNAl * 2u(M0:N81l 

COMPUTE I NTEfi MEDIATE STEP 

2BmR(MC:NCJ = B21MC:NC) .CTRL. ( TEMP(MC:NC1 - C0N3* IT2(MC:NC»* 
l 12 J I MR: NR J -20 C 2 : NL I l -T3IMC:NCl* t20tMA:NA)-2Q(M8:NB» J >5 

COMPUTE SULJTIOM RESULT 

2IMC:NCI = BZIMCsNC) .CTRL. I TEMP(MC:NC» - C0N3* IT2(MC:NC) * 
l I Zb aR I MR :NR J - 2UAR{2:NLJ ) - T 3 IMC: NC J * ( 2B AR (M A:N A WE AR ( MB: N8 ) J )» 
cND 
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APPENDIX B 

A STAR CODING FOR THE METHOD OF PARTIAL IMP LICOTZATION 


The program listing for the method of partial implicitization is presented in this 
appendix and uses an assumed F ORTRAN - like language with extensions for vector 
operations. 

HSii » H»*2 

CON = R*l)cLT * -1 /2. 0 
F = — HSQ — i.O^UiLT 
F3 Q = F**2 

OXHt.:MC>= Cu.M * (PSUMi»:.«J - PSi(ML:NL)l 
JVChCSMCJs CJ» * (PS) lM„:f.A) -PSItMB:KS>> 

UtMCtNCi = or (Me SMC) ♦ Dc.LT 
cMCSNC) =-L*Y(Me SMC) *■ OcLT 
o(MC:NC) = OX(MC:NL) *■ OCLT 
HtMeINC) =-JX(ML5NLI ♦ CELT 
Tc MP{ MLMl Ifs-MMI ) * -H3G* ZC ( MLM1 : MMI ) 

COMPUTE CUE FF ILltNTS ANl) TERMS UScC IK TFt GENERAL EQUATION 

0Z(2:NLPl ) =l)( ML:NA)*Za(2:M.Pl) 
tZ(.MRMl:N2k)=i ( >1bSMK) *ZC(”k-Ml: N2F ) 

GZt.HLPl:NRPl) =G(ML:NR)*Ze(MLPl:KRPl) 

HZ (MLMl S.NR.'.l j = H< ML: •NR) *ZL (MLMI JNRM1J 

AMMC:NC) = TertPMAJNA) -CZ ( ML P l : KLP 1 ) - EZ(MkPl:NRPl)— GZ(M2A:N2A) 

AL ( He : NC ) - T -MR ( ml: Me) -L Z ( M2 L : N2L ) -GZ ( MLP l : Nl P l ) -HZ ( ML M l : NL« i ) 

AN (Me : NC ) = T i MP ( MR : Nk ) -EZ ( M2 R :N2S ) -GZlMPPlsNPP l ) —HZ ( MKMl tNRMl ) 

Art(M£:NC) = T. hp ( MC: NC ) 

JIMC:NC) = TchP (McWutJ) - u)Z (ML Ml: NLM l J ~i Z( PPM l :N»M 1 )- HZ(M23:N2B> 

Je(MeSNR) =|J( teS itR) *:(MLSNC) 

HJtn t:NH)= f< ML": Mm) *G(HoSNC) 

0<sMJ4(Kl:NC) = Je(MK:»|R) + HG(MC:NC) ♦ DE(MC:NC) * HG(HtcNA) 

• ANUrt(MC:r.C) = J 4( MC:<MC ) *C(ML: Me ) ♦ AC < M C:NC ) *H (MC : NC )+AL ( MC: NC ) *31 MC: NC ) 
l + »,*( mc : 'ML > * t> (,‘1 C:(mC) 

CUMPJTl COE FFICICnTS CUP the points ACJACEVT TO THF BPJN'JARIES 
(TWO-OIM.flSlL.-MAL NIT..TIUN IS US^O HEWt F OR tASlEF RfcAOING) 


FOUR eUkjLRS 
TOP LeCT 

AM (N- 1 » 3 ) = *M(fM-I,3) — G (N— 1,3) *ZO(N» 3) -0 ( N- 1 , 3 ) *Z C ( N- i , 2) 
Gl.M-1, a ) = 0. 0 
D( N-l,3)=O.0 
C T TP RIGHT 

AM(N-l,N)= HOM-UN) -GIN-l ,N) *ZO(N,N) -E t N- 1 , N ) *2 C ( N- 1 , N+ l ) 
£ ( N— 1 » N > =J.U 
C BOTTOM LEFT 

AM (2,31 s A M( 2 , 3 ) —0 (2,3) *Ze(2»2) -H I 2 , 2 ) *Z0 ( l , 3 ) 

0(2,3) =0.0 
H(2,i)=0.0 
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APPENDIX B — Continued 


METHOD OF PARTIAL IMPLIC IT l ZAT ION 

ONE-DlMENSIONcO VECTOR IS USEE 

ZO RESULT VQRTICI TY VECTOR *T TIME. T 

Z RESULT VECTOR AT TIME, T + CELT 

M NUMBER OF ELEMENTS IN ONE COLUMN OF GRID 

N2 TOTAL NUMBER OF ELEMENTS IN VECTOR « N**2 ♦ 2N 
(INCLUDES BOUNDARIES AND APPENDED COLUMNS) 

PSI STREAM FUNCTION VECTOR 
H =l/(N~l) 

OX, DY ,U,E,G,H,T£ MH , AK , A L, AM * AN»fI , DeN'UM , ANUM TEMPORARY VECTORS USED AS 

TERMS IN THE GENERAL ECUATICN 

HI BIT CONTROL VECTUR WCHICH PROHIBITS STORAGE ON THE BOUNDARIES 

THL l BOUNDARIES ARE COMPUTEC FIRST, WHEN THE INTERIOR Z ELEMENTS 
AtE COMPUTED 61 DOES NOT ALLOW THt. NEW Z ELEMENTS TO BE STORED 
AT The BOUNDARIES 

DIMENSION li M2) , Z0(N2) .TEMP IN2 ) ,DY IN2 • , DXI N2 ) , D( N2 ) ,EIN2) , GIN2) .HI N2) , 

I AKI N2) »mL( HZ ), AM (W2) , ANI.N2) , OIN2) , DENOMI N2 ) . ANUM C N2) 

1 ,0Z(N2) ,EZ(N2) «£Z(N2J ,HZ(N2) ,C£(N2) t HG(N2) 

BIT BZIN2) 


COMPUTE IFFSl T S 

BEGINNING SUBSCRIPT 

MC « 2*N«-2 
Mb » MC-l 
MLPl= MA-N 
ML * KC-N 
MLMI * Mtt-N 
MRM1 * MbHN 

cNOINo SUBSCRIPT 

NC * N**2 - i 
N A * MC + l 
NR * NCHN 
N2r * UR ♦ N 
NRPI= NK+l 
SRH1= Ni\- l 
NL PI * NL ♦ 1 


c 



BOTTOM RIGHT ' 

AM(2,N>* Artt2,N)- e 1 2»N)*Z3 (2» N ♦! )— HI 2,N I *ZDl 1»N+ U 

fcIZit hii.O ■ 

H(2,N )*0.0 •.! 

C ! 

C BOTTOM P TINTS AOJmCLNT TO BOUNDARY AND TCP i 

C • 

do 200 4 * 4 , n-i ; 

AM ( 2,4 ) * AM ( 2 , 4 ) -HI 2,41* 2011, J) 

HI2,4J > 0.0 

AMIN-1,41 = AM{\-I.,4) - GIN-1,4) *Z0<N,4> 

G(N-l,4J *0.0 
200 CONTINUE 
C 

C RIGHT AND LEFT POINTS ADJACENT TO BOUNDARIES 
C 

00 210 I=B»N-2 

AMU, 3)- AMU, 3) -D(I»3)*ZOU»2l 
0( I ,4)*0.U 

AMtI,Nl* AMII.NI- £( !,Nl*ZU(I,N + U 
Eli »N) =0. 0 
210 CONTINUE 
C 

C COMPUTE GfcNtFAL EOUaTION 
C 

2( MCiNC I “B.ZIMC :nC I .CTRL.! I F*£M IMC:NC J-ANUMI MC SNC I l/IFSO-DENOMIMCtNC) I J 
cNO 
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APPENDIX C j 

i 

A STAR CODING FOR THE ALTERNATING DIRECTION t 

IMPLICIT METHOD BY REPEATED TASKS 


The program listing for ADI by repeated tasks is presented in this appendix and 
uses an assumed FORTRAN- like language with extensions for vector operation**. 

C 

C AOI BY REPEATED TASKS 

C 

C PSI STREAM FUNCTION VECTOR 

C l VOS. TIC I TV VECTOR NOW CuMPLTING 

C UfcRl , DER2 PSI DtR I VAT I VC VECTCRS 

C A,c$,C,0 COcFFICINTS IN MATRIX ECUATICN 

C 9 IS A. CONSTANT 

C G,F,W TEMP VECTORS AS IN THOMAS ALG 

C T TEMPORARY FOR VORTICITY 

C 

C N NUMdcR OF INTERIOR McSH PTS IN 1 LINE 

C d REYNOLDS NUMBER 

C H SPATIAL GRIJ SPACING 

C OT TIME STEP 

DIMENSION PSI (M,M»,Z(M,MJ, J£RHM,N),CER2CM,N), 

I AlNI,CtW»,U(NJ ,W(N) ,F(N-l,NJ ,G(N,NJ ,T(N»NJ 
LOGICAL COLwlSE 

HSQ=rt*H 

CUN*JT*R/<*. 

CONl«-2./HS'J 
6 = — 2. *OT-HSO 
NP1 = N«-I 
H*N*2 
■*M1=N-I 

CJN2= 2. *DT— HSU 
COL wl SE=.T. 

c 

C COLWI SE=. T. IMPLIES PSI,Z STOREC BY CCLS OF GRID 

C THUS, IMPLICIT IN X-OIRECTION 

U CONTINUE 
C 

C COMPUTE BJUNOakY VALUES UF Z 

C 

IF { . NC T . CJLWISfci GO TO 10 
C 

C PSI,Z STuRcU 6Y GRIJ COLS 

C ZC2JNP1, II REFERS TU FIRST GR 1 0 COLUMN 

£ 

ZI2SNPI, 1>=CCNI*PSU2:NPI,2 J 

I I a MP1,M)=CUN1*PSI 12 CNPl ,NPIJ 
U0 5 J=2,NPi 

l ( i,JJ=CLNI*PSI(2,J» 

5 ZC M, J) =-LUNi* (ri-PS ININP l, J J I 
GU TO 9 
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APPENDIX C - Concluded 

10 CUNT I NUC 
C 

C PSl,Z 5TUREJ «Y GRU ROWS 

C ZI2:NP1.,U REFERS TU FIRST GRID RCW 

C 

Z(2SNP1.,1) =C'JN l*PSII2sNPl*2J 
Z(2:NP1»M) =-CUNl* (H-PS M2*NPl» NPl)) 

DO 4 J = 2 » NPl 
Z( l,4»=CoNi* PSK2.JI 
4 Z ( M, J ) =Ct‘NZ* PSUNPltJ) 

•• CONTINUE 

Dck1( 1:M*N|=(PSI ( 2*^<- 1: M*HJ -PS I Q:M*m- 2*M H*CUN 
DtRZI l:M*NJ*(PSl ( H+-2 s IJ-PSI IWK*H-H-1 ) )*CON 
C 

C IF CULrtI SE»0LK1*(R*DT*H/2J*D/0X (PSD 

C IF NUT CUU»ISZ,0cRi=-IR*0T*H/2) ♦D/DYIPSI) 

C 

C BEGIN TF. 1-DlAU ScTUP ANO SDL 

C 

CON3=l./B 

Wll:N)=C0N3 

U(lsN)*Z(2JNPl,2)*CUN2 * Z( 3:M , 2) *<-OER 1( 2 :N«-l « ’ ‘-0T ) 
l ♦ ZIt:N,2)*U>cAU2:NPl,l)-DT J-Z 12 s NPl, 1) * l DEP2 1 2 :NP l , 1) +DT ) 
GI i5N,l)=DIl:NI*4tl:N) 

00 1 J= UN, HI 

CI1:-MI= UT-JtK2( 2SMPI, J I 
Fti:N,JI >v,(i:fc)*W(lSN) 

J( 1:'U = Z12:NPI,J+2J *CJN 2+Zt j:m , J+ 2J* I -DERI 1 2 :NPl , J + l l-DT) 

1 *Z( UN, J*2J*USRII2:NPl, J + IJ-DT) 

It- l J .to. !J,ni Q( l:WI -01 liNI-Z<2:NPl,M)* (-DEP2I 2 tNPl,N i *0T t 
A( 1:N)*JT« jfc-U (2: NPl, J+U 
T«-HP= B- a ( I ) *F ( 1 :N, J J 
WI l:N)=i./TjMPU:N) 

G( lJiN,J + l< = (UI IJ N) -A I l:N)*G(l:N,J)}*Mtl:N) 

1 CONTINUE 
C 

C BEGIN BACK SUB ST I TUT I JN 

C 

T( UN,M> = G ( 1 : N, N ) 

OU 2 J=l,N1l 
L * N-J 

LP1 =1+1 > 

2 TC l : N » L J ~ GU : N, L I -F U s N, L ) * T{i:N,LPI) 

C 

C Rc ARK AN Gfc Z 

C 

DO > I =2, NH I 
$ Z I 2SNP 1 , I J= T ( I , l : N) 

COLtfl SE= ..■JOT. CL'LNISt 
LON=-CUN 
C 

C IF CON GT 0 TtST FOR CONVERGENCE 

C CALL PUISSON SUlVER TO GET FS» 

C 

GO TO II 
END 
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APPENDIX D 


A STAR CODING FOR TEE ALTERNATING DIRECTION 
IMPLICIT METHOD USING STONE’S ALGORITHM 


The program listing for ADI using Stone’s algorithm is presented in this appendix 
and uses an assumed FOR TRAN - Like language with extensions for vector operations. 

C 

C 

C ADI BY STuNES RECURSIVE DOUBLING 

C 

C PSI STRcmM FU..CTIUN VECTOR 

C l VORTICITY VECTOR NCh COMPUTING 

C DEN It DEK i PSI DcR IVATI VE VECTORS 

C A * b , C » D COEFFlClNTS IN MATRIX EQUATION 

C B I S A CONSTANT 

C T ' TEMPORARY FOIL VORT I C IT Y 

C UI ,UIMI ,OlM2,AC, TEMP,X, Y TEMPORARY VECTORS NEEDED FOR 

c Recursive doubling 

C M UPPER DIAGONAL OF FACTORED MATRIX 

C OR EC 1 P RECIPROCAL OF EACH ELEMENT IN LOWER DIAGONAL 

of factored matrix 

C M NUMBER OF INTERIOR MESH PT S IN 1 LINE 

C R REYNOLDS NUMBER 

C ri SP AT I A*. GRID SPACING 

C JT TIME STEP 

0 1 Ml NS I ON PS l ( M ,M J ,Z I M , M J , DER II P, N ) , 0EP2 IH , N J , 

1 \IN),CtN), J<NJ , ACm,OIlN)»OIMUO:NJ,QIM2{-UN) , TEMP INI 

2 URECIPlNJiM(N) » X( N) » Y (N) t T ( N «N I 
REAL M 

LOGICAL COLWISE 

HSO=H*H 

CUN=0T*P/4. 

CONl*-2 ,/HSJ 
NPl = N+i. 

M*N«-Z 

NMl=N-l 

B=-2.*DT-HS0 

BSQ= 3*b 

CUN2= 2»*DT-HSQ 

COLW I SE = . T. 

C 

C COLWI SE= . T • IMPLIES PSI, 2 STCRtD BY CCLS OF GRID 

C THUS, IMPLICIT IN Y-OI RECTI ON 

11 CONTINUE 
C 

C COMPUTE BOUNDARY VALUES GF L 

C 

IF (.NOT. CJLWlSE) GO TO 10 
C 

C PSI , L STORED 3Y grid COLS 

C 2I2:NPltlJ KtPtnS TJ FIRST GR 1 0 COLUMN 
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APPENDIX D - Continued 


I-l 

21 CONTINUE: 

1 - 1*1 

IF ( I .GT. N/2) GO TU 22 
TEMPI I -1 :N)=«Jl <I-l:N)*aiMl(0:N-I*lJ 
1 +ACI1:N-I*2)*QIM2I I-lsN )*0 IP2(-I:N-II 
01MU I:N) = 01 < I: N)*CIM1 IO!N-I) *AC{l:N-I*i>*QlM2I-l:N-I-l> 
1 *QIM1(I:N) 

01 M2 CI-lJN) =T EMP ( l-ll N) 

OK [+1: N) = U*UlUllsN-ll*ACt I *1 :NI *CIH2 I I-l : N-2) 

GO TO 21 

22 CONTINUE 

URECiP(2tNJ=0( l:N-ll/0(2iM 
M( 2: N)-A(2sN)*JRcCIP( 1:N-1) 

C 

C FORWARD S JUS T I TUT IUN 

C 

Y11:n)= U(l:M 
TEMP«2:N) — M(2sN) 

1-1 

20 CONTINUE 

1F( II .GT. N/2) GO TO 30 

Vi I+ltN»=y{I*l:Nl* Y(l:N-I)*TtMP(I*lsM 
TEMPI I*lsN)=T zMP ( I*i:NI *TEMP(lsN-I ) 

I- I *1 
GO TO 20 
C 

C 6ACk SUBSTITUTION 

C 

30 CONTINUE 

X(l:N)= Y < l:N) -JREC IPI 1 :N) 

T«.MP< l:N-l)=-C( 1 :N— 1) *UK£CI P( 1 s N1 
1-1 

40 CONTINUE 

IF I I .GT. N/2) GO TO 50 

XI l IN- 1 ) = X( ♦ XI I*l:N)*TEFPll:K-I) 

TEMPI I:N-I )= TEMP 1 1 tN-I ) - TEMP (I*1:N) 

I- I *1 
GO 10 40 

50 Tt l:N.JI = Xll:.\) 

C 

C INTERCHANGE ROWS AND COLS 

C 

DO 3 1=2. NP1 

3 2 I 2s M-l . I )= Tll-l,l:Il 

Z( 2:NP1‘, i) =C0N1*PSI (2JNP1.2 J* 
Z(2:.\IPl,f)=C0M*PSIt2:NPl t NPl) 

DO 5 J=2,NPl 

Z< 1. J )=CLN1*PSI I 2. J) 

5 Z<M,J)=-CCNl*lH-PSININPl, J) ) 

GO TJ <) 

10 CONTINUE 
C 
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APPENDIX D - Concluded 


PSItZ STOKcl) «3Y GRID ROWS 
Zt2:NPl,L» KCFERS Td FIRST GRIO RCW 

Z( 2:NP l, U =CO"U*PSI (2 :NP 1 1 21 
Z ( 2 SNP1 » M ) =-CUNl * IH— PSII2SNP1, NP 1)1 
00 4 J*2»NP l 
Z( i»J)*CCN2* PSI C2»J) 

4 Z ( Mr J ) »CUN2* PSI(NP1,J) 

9 CONTINUE 

DERI I l !M*N ) » (P SI I 2*M+l! M*M) — PS I II SP*M-2*M } >*CON 
Oc R2 1 l : M* N » = ( P S I I M+2 : M *M-M+ li-PSl ( MsM +M-M- 1 J ) *CON 

IF CJLWISErDERl* U*OT *H/2 J*D/OX IPS II 
IF NJT CCiLWlSt »D£Rl*-(P *0T*H/2 )*0/DYtPSII 

’ BEGIN TR I— 01 AG SETUP AND SOU 

00 I J-ltN 
JP !*.!«■ I 
JP2*J+2 

0(i:NJ=Z(2:NPl, JPII*CUN2 +Z I/s NP i , JP2 l*(OER ( 2 :NP 1 , J I-OTJ 
l +Z 1 2 !NPl * J > * t -06R 2t 2SNP I « J ) -DT t 

01 l»»0(ll-Z{i, JPl J*(-0ERI12»J »«-CT J 
UtNl®0tNJ- ZIP., JPD* IDERtlNPl, JI+CTJ 
CIl NMU=OT ♦OiRl(2sN,J) 

At 2SNJ * UT- 0ERII3 NPl.J) 

AC ( 2 : N ) - -M2:Nl*CtlsNMl> 

AC 1 1) =0« 

QIM2t-U NI=l. 
aiKlt i: N) * a 
Ql Ml (01=1. 

gi { 2 tNi = SSQ + AC( 2 • N ) 

QI ( 1 J = 13 

BEGIN FACTORIZATION 

CULWISE= .NOT. CuLWISc 
CON®— CUN 

IF CUN GT 0 TEST FUR CONVERGENCE 
CALL PGISSJN SULVc* TO GET PSI 

GU TO II 
END 
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TABLE H.- ESTIMATED TIMINGS FOR STAR 
64-BIT VECTOR OPERATIONS 


Operation 

Time in clocks 

Add, subtract 

33 

Multiply 

38 + V 

Divide 

87 + 2 V 

Transmit 

30 *f 

Transmit index list 

34 + 8 V 


TABLE in.- RECURSIVE DOUBLING AND SCALAR TIMINGS 
FOR TRIDIAGONAL SYSTEM 


Recursive doubling: 

Trd = 15N^ log 2 n + 9.5n - 10N^ + C51 log 2 n - 3 
Scalar coding: 


T g = 190n 


n 

8 

log 2 n 

3 

t rd/ t s 

1.5 


16 

32 

64 

128 

256 

512 

1024 

2048 

4096 

8192 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

1.1 

0.86 
l 

0.72 

0.67 

0.66 

0.70 

0.75 

0.81 

0.88 

0.95 


16384 

14 

1.03 


44 














Method 

Formula for 
STAR time 
in clocks 

Number of vector operations and length of vector 

Add, 

subtrafct 

Length 

Multiply 

Length 

Divide 

Length 

Transmit 

Length 

BR 

15.5n 2 + 799 

15 

n 2 

8 

n 2 

■ 

m 


■ 

PI 

29. 5n 2 + 4n + 1709 

2 

26 

n 

n 2 

2 

14 

n 

n 2 

1 

n 2 

2 

1 

n 

n 2 

RT 

25. 5n 2 + 722n + 142 

9n 

2 

n 

n 2 

8n 

2 

n 

n 2 

n f 

n 

1 


ST 

15nN^ log 2 n + 25. 5n^ 
- 10nN A + 65 In log 2 n 
+ 305n + 142 

7n 

2 

2n log 2 n 
3n(log 2 n - l) 

n 

n 2 

n a 

n a 

7n 

2 

4n log 2 n 
8n^log 2 n - l) 

n 

n 2 

n a 

n a 

n 

a 

4n 

n(log 2 n - l) 

n 

n a 


Transmit 
indexed Length 
list 
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TABLE VI.- PREDICTED NORMALIZED l 

STAR COMPUTER RUN TTMEt 


n 

BR 

PI 

ADI with RT 

13 

136 

113 

192 

20 

414 

295 

448 

27 

560 

430 

612 


^All entries = (No. steps) * (Normalized 
time per step) 
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N-l.j 


KnJ • N-rt.j 

^.j-1 


(1) || <N.j) = 0 

(2) V 2 0 (NJ) = -^ N>j 
From (1). ^ N+l j = ^ N . ltj 

From (2) A< ^ - = 




Therefore, S M . = ~ 

n.J 


Figure 3.- Representative derivation of £ boundary conditions. 















9 131 


5 121 


< im rw 


3 1 10 ^ Tin 


2 9 I 16 23 " 

1 

11 8 15 22 


30 

37 

29 

i 36 

1 

Boundary 


2 49 

56 

’"I 

S3 

NC 




1 






1 

1 

\ 



1 

! 

1 


44 

1 51 

58 


Intorlor 
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result* 


®°0000060 
09 111110 0 
0 9 1 1 l 1 1 0 0 
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(a) Grid arrangement. <b) Bit control vector. 

Figure 6.- Grid arrangement and bit control vector for method of partial implicitization (PI) for n « 5. 
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